Group Project - Startup Analysis¶

Library¶

In [1]:
import pandas as pd
import numpy as np
from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
import plotly.express as px
from wordcloud import WordCloud
import pycountry

Context¶

A startup is a newly established company or project founded by an entrepreneur to explore innovative ideas and create a scalable business model. The success of a startup is highly uncertain and comes with significant risks. Many startups face the possibility of failure due to intense competition, market fluctuations, financial challenges, and other obstacles. However, a minority of startups do manage to overcome these hurdles and achieve remarkable success, becoming influential players in their respective industries. The journey of a startup is characterized by its adventurous spirit, where the pursuit of success is accompanied by the acknowledgment of potential failure, making it a challenging yet rewarding endeavor for ambitious entrepreneurs.

In [2]:
im = Image.open("assets/Startup-Funding-Series.jpg")
display(im)

Startups have a profound impact on the economy, fostering innovation, job creation, and technological advancements. These entrepreneurial ventures inject fresh ideas and disrupt traditional industries, driving competition and spurring economic growth. However, startups also entail inherent risks, as many of them face failure, leading to potential financial losses and missed opportunities. Predicting whether a startup will succeed or fail is crucial for several reasons. Firstly, it allows investors and stakeholders to make informed decisions, allocating resources efficiently. Secondly, predicting success or failure can help guide government policies and support programs to nurture promising startups and mitigate risks. Additionally, understanding the factors that contribute to success or failure aids entrepreneurs in making strategic adjustments and increasing their chances of building sustainable and influential businesses, ultimately contributing to the overall health and vibrancy of the economy.

Problem Statement¶

The task at hand involves predictiong the likelihood of success or failure for a given startup. The challenge is to develop a comprehensive and accurate predictive model that takes into account various factors and variables influencing startup outcomes.

In [3]:
df = pd.read_csv("data_input/investments_VC.csv", encoding = 'unicode_escape')
df.rename(columns = {' market ': 'market', ' funding_total_usd ': 'funding_total_usd'}, inplace = True)

df.head()
Out[3]:
permalink name homepage_url category_list market funding_total_usd status country_code state_code region ... secondary_market product_crowdfunding round_A round_B round_C round_D round_E round_F round_G round_H
0 /organization/waywire #waywire http://www.waywire.com |Entertainment|Politics|Social Media|News| News 17,50,000 acquired USA NY New York City ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 /organization/tv-communications &TV Communications http://enjoyandtv.com |Games| Games 40,00,000 operating USA CA Los Angeles ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 /organization/rock-your-paper 'Rock' Your Paper http://www.rockyourpaper.org |Publishing|Education| Publishing 40,000 operating EST NaN Tallinn ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 /organization/in-touch-network (In)Touch Network http://www.InTouchNetwork.com |Electronics|Guides|Coffee|Restaurants|Music|i... Electronics 15,00,000 operating GBR NaN London ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 /organization/r-ranch-and-mine -R- Ranch and Mine NaN |Tourism|Entertainment|Games| Tourism 60,000 operating USA TX Dallas ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 39 columns

Features Explanation¶

  • permalink: A unique URL-friendly identifier for the startup/company.
  • name: The name of the startup/company.
  • homepage_url: The URL of the startup/company's website.
  • category_list: The list of categories or industries the startup/company operates in.
  • market: The target market or industry focus of the startup/company.
  • funding_total_usd: The total amount of funding the startup/company has received in USD.
  • status: The current status of the startup/company (e.g., active, acquired, closed).
  • country_code: The country where the startup/company is based, represented by its country code.
  • state_code: The state or province where the startup/company is based, represented by its state code.
  • region: The region or geographical area where the startup/company is located.
  • city: The city where the startup/company is located.
  • funding_rounds: The total number of funding rounds the startup/company has gone through.
  • founded_at: The date when the startup/company was founded.
  • founded_month: The month when the startup/company was founded.
  • founded_quarter: The quarter when the startup/company was founded.
  • founded_year: The year when the startup/company was founded.
  • first_funding_at: The date of the first funding round the startup/company received.
  • last_funding_at: The date of the most recent funding round the startup/company received.
  • seed: The amount of funding received in the seed funding round.
  • venture: The amount of funding received in the venture funding round.
  • equity_crowdfunding: The amount of funding received through equity crowdfunding.
  • undisclosed: The amount of undisclosed funding received.
  • convertible_note: The amount of funding received through convertible notes.
  • debt_financing: The amount of funding received through debt financing.
  • angel: The amount of funding received from angel investors.
  • grant: The amount of funding received in the form of grants.
  • private_equity: The amount of funding received from private equity sources.
  • post_ipo_equity: The amount of funding received after the initial public offering (IPO) in the form of equity.
  • post_ipo_debt: The amount of funding received after the initial public offering (IPO) in the form of debt.
  • secondary_market: The amount of funding received through secondary market transactions.
  • product_crowdfunding: The amount of funding received through product crowdfunding.
  • round A-H: Rounds startup financing
In [4]:
im2 = Image.open("assets/Graph-Showing-The-Various-Stages-Of-Startup-Investment-PowerPoint-Slides.png")
display(im2)
In [5]:
im3 = Image.open("assets/Medium.png")
display(im3)

Data Cleaning¶

Categorical¶

In [6]:
df.shape
Out[6]:
(54294, 39)
In [7]:
CATEGORICAL = list(filter(lambda x: df[x].dtype=='object',df.columns))
def show_stats(columns):
    stat = {}
    for col in columns:
        stat[col] =  [df[col].nunique(), 
                      df[col].isna().mean()*100, 
                      df[col].dtype]
    return pd.DataFrame.from_dict(stat, orient='index', columns=['Unique', 'Missed values in %', 'DType'])
show_stats(CATEGORICAL)
Out[7]:
Unique Missed values in % DType
permalink 49436 8.943898 object
name 49350 8.945740 object
homepage_url 45850 15.296350 object
category_list 16675 16.239363 object
market 753 16.252256 object
funding_total_usd 14617 8.943898 object
status 3 11.364055 object
country_code 115 18.655837 object
state_code 61 44.448742 object
region 1089 18.655837 object
city 4188 20.208494 object
founded_at 3369 28.990312 object
founded_month 420 29.122923 object
founded_quarter 218 29.122923 object
first_funding_at 3914 8.943898 object
last_funding_at 3657 8.943898 object

State code column can be dropped as it doesn't really play a big role, but has too much missed values We can remove everything connected with links, as there are too many unique values for them. Also remove category list, because we already have Market column

In [8]:
df = df.drop('state_code', axis=1)
df = df.drop(['permalink', 'homepage_url', 'category_list'], axis=1)

funding_total_usd should be converted from str to int

In [9]:
df['funding_total_usd'].head(2)
Out[9]:
0     17,50,000 
1     40,00,000 
Name: funding_total_usd, dtype: object
In [10]:
df.drop(df[df['funding_total_usd'] == ' -   '].index,inplace=True)
df['funding_total_usd'] = df['funding_total_usd'].str.replace(",","")
df.dropna( axis=0,subset=['funding_total_usd'],inplace = True) and df.dropna( axis=0,subset=['market'],inplace = True)
df['funding_total_usd']=df['funding_total_usd'].astype(np.int64)
marketdf = df[['market','name','funding_total_usd','status']]

Numerical¶

In [11]:
show_stats(list(filter(lambda x: df[x].dtype =='float64', df.columns)))
Out[11]:
Unique Missed values in % DType
funding_rounds 17 0.000000 float64
founded_year 101 21.443763 float64
seed 3337 0.000000 float64
venture 9300 0.000000 float64
equity_crowdfunding 252 0.000000 float64
undisclosed 687 0.000000 float64
convertible_note 299 0.000000 float64
debt_financing 1872 0.000000 float64
angel 999 0.000000 float64
grant 532 0.000000 float64
private_equity 847 0.000000 float64
post_ipo_equity 239 0.000000 float64
post_ipo_debt 57 0.000000 float64
secondary_market 20 0.000000 float64
product_crowdfunding 176 0.000000 float64
round_A 2035 0.000000 float64
round_B 1269 0.000000 float64
round_C 740 0.000000 float64
round_D 458 0.000000 float64
round_E 225 0.000000 float64
round_F 110 0.000000 float64
round_G 32 0.000000 float64
round_H 5 0.000000 float64
In [12]:
df = df.dropna()
df = df.drop_duplicates()
print('Length after cleanup', len(df))
Length after cleanup 27970

Decode Countries¶

In [13]:
def code_to_country(text: str):
    if text == 'ROM':
        text = 'ROU'
    return pycountry.countries.get(alpha_3=text).name

df['country'] = df['country_code'].apply(code_to_country)
df.drop('country_code', axis=1, inplace=True)
In [14]:
df['country'].head()
Out[14]:
0      United States
2            Estonia
3     United Kingdom
4      United States
10             Chile
Name: country, dtype: object

Statistic Peak¶

In [15]:
df.describe()
Out[15]:
funding_total_usd funding_rounds founded_year seed venture equity_crowdfunding undisclosed convertible_note debt_financing angel ... secondary_market product_crowdfunding round_A round_B round_C round_D round_E round_F round_G round_H
count 2.797000e+04 27970.000000 27970.000000 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 ... 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04 2.797000e+04
mean 1.774500e+07 2.007866 2007.064140 3.044336e+05 1.065874e+07 7.579419e+03 1.051359e+05 3.182541e+04 2.595650e+06 8.701276e+04 ... 4.062824e+04 1.000055e+04 1.686132e+06 2.188271e+06 1.860551e+06 1.196603e+06 5.584919e+05 2.877252e+05 9.824156e+04 2.499106e+04
std 1.955124e+08 1.509339 7.380901 1.077503e+06 3.572890e+07 1.960442e+05 2.357928e+06 1.838252e+06 1.819967e+08 6.047256e+05 ... 3.100436e+06 5.581497e+05 5.953910e+06 9.238159e+06 1.009298e+07 1.283717e+07 7.013485e+06 8.319132e+06 6.966990e+06 3.611924e+06
min 3.000000e+01 1.000000 1902.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 4.500000e+05 1.000000 2005.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 2.276304e+06 1.000000 2009.000000 0.000000e+00 9.295685e+05 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
75% 1.100000e+07 2.000000 2011.000000 1.286378e+05 8.698066e+06 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
max 3.007950e+10 18.000000 2014.000000 1.000000e+08 2.351000e+09 1.700000e+07 2.508000e+08 3.000000e+08 3.007950e+10 3.025439e+07 ... 4.000000e+08 7.200000e+07 2.250000e+08 5.420000e+08 4.900000e+08 1.200000e+09 4.000000e+08 1.060000e+09 1.000000e+09 6.000000e+08

8 rows × 24 columns

In [16]:
print('Mean total funding (in $):',df['funding_total_usd'].mean())
print('Median total funding (in $):',df['funding_total_usd'].median())
print('Mean year founded:',round(df['founded_year'].mean()))
print('The oldest company: ', int(min(df['founded_year'])))
print('The latest company: ', int(max(df['founded_year'])))
print('Median seed (first official investment round): ', df['seed'].std())
Mean total funding (in $): 17744997.722130854
Median total funding (in $): 2276304.0
Mean year founded: 2007
The oldest company:  1902
The latest company:  2014
Median seed (first official investment round):  1077502.5648221637

EDA¶

In [17]:
top_spheres = df['market'].value_counts()[:10]
fig = px.pie(values = top_spheres,
            names = top_spheres.index,
            title = 'Top 10 most expensive markets');
fig.show()
In [18]:
plt.figure(figsize=(28,8))
sns.countplot(x = 'market',
              data = marketdf,
              order = marketdf['market'].value_counts().iloc[:15].index)
plt.xlabel('Market')
plt.ylabel('Counts')
plt.title('Amount of startup by Markets');
In [19]:
mark=marketdf['market']
mark.dropna(inplace=True)
wordcloud = WordCloud(background_color = 'white',
                      width = 5000,
                      height = 3000).generate(" ".join(mark))
plt.figure(figsize=(20,10))
plt.imshow(wordcloud,interpolation='bilinear') 
plt.axis('off') 
plt.show()
In [20]:
df_filtered = df[df['status'] != 'operating']

fig = px.histogram(df_filtered['status'], 
                   title='Startups status')
fig.update_xaxes(categoryorder = 'total ascending')
fig.show()

From the above result, we can see that most of the companies either end up being acquired or succeed since their inception as startups.

In [21]:
rounds = {}
for i in df.columns:
    if 'round_' in i:
        rounds[i] = df[i].mean()/10**6
fig = px.bar(x = rounds.keys(),
             y = rounds.values(),
             labels = {'y': 'Investment ($ million)',
                       'x': 'Rounds'});

fig.show()

Each round, a startup companies get less and less funding - because it can make profit for it self.

In [22]:
fig = px.histogram(df, 
                  x = 'founded_year',
                  labels = {'founded_year': 'Founded year'},
                  title = 'Startups founded years distribution')
fig.show()

Most of the startups were founded in the 2000s. The peak of startup establishments occurred in the year 2012, with a total of 3,534 startups founded.

In [23]:
fig = px.histogram(x = ['angel', 'grant', 'venture'],
                   y = [len(df[df['angel']!=0]),
                       len(df[df['grant']!=0]),
                       len(df[df['venture']!=0])],
                   title = 'Startups invesetment sources');
fig.update_xaxes(categoryorder = 'total ascending')
fig.show()

This is obvious, because venture capitals are porvided by professional investors and give startups way more funding.

In [24]:
top_countries = df['country'].value_counts()[:10]
top_countries_df = top_countries.reset_index()
top_countries_df.columns = ['Country', 'Count']
top_countries_df = top_countries_df.sort_values(by='Count', ascending=True)
top_countries_df
Out[24]:
Country Count
9 Ireland 210
8 Spain 354
7 Germany 473
5 Israel 476
6 China 476
4 India 500
3 France 544
2 Canada 861
1 United Kingdom 1574
0 United States 19097
In [25]:
fig = px.bar(top_countries_df,
                  x = 'Count',
                  y = 'Country',
                  orientation = 'h',
                  title = 'Top 10 Countires with the Most Startups',
                  labels = {'Count': 'Number of startups: ',
                           'Country': 'Country: '})
fig.show()

As we see, US is the most popoular region for start a startup journey.

In [26]:
df.head()
Out[26]:
name market funding_total_usd status region city funding_rounds founded_at founded_month founded_quarter ... product_crowdfunding round_A round_B round_C round_D round_E round_F round_G round_H country
0 #waywire News 1750000 acquired New York City New York 1.0 2012-06-01 2012-06 2012-Q2 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 United States
2 'Rock' Your Paper Publishing 40000 operating Tallinn Tallinn 1.0 2012-10-26 2012-10 2012-Q4 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Estonia
3 (In)Touch Network Electronics 1500000 operating London London 1.0 2011-04-01 2011-04 2011-Q2 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 United Kingdom
4 -R- Ranch and Mine Tourism 60000 operating Dallas Fort Worth 2.0 2014-01-01 2014-01 2014-Q1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 United States
10 1,2,3 Listo E-Commerce 40000 operating Santiago Las Condes 1.0 2012-01-01 2012-01 2012-Q1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Chile

5 rows × 35 columns

In [27]:
top_funding = df[['country','funding_total_usd']]
top_funding = top_funding.groupby('country')['funding_total_usd'].sum().reset_index()
top_funding = top_funding.sort_values(by = 'funding_total_usd', ascending = False).head(10)
top_funding
Out[27]:
country funding_total_usd
100 United States 385898175119
20 China 21937000804
99 United Kingdom 16358468613
42 India 11596801424
17 Canada 9957982897
34 Germany 6855077419
45 Israel 5402961851
33 France 3750926917
87 Spain 3341128543
64 Netherlands 2969230977
In [28]:
fig = px.bar(top_funding,
             x = 'funding_total_usd',
             y = 'country',
             title = 'Top 10 countries by total funding',
             labels = {'x': 'Total fund (4 billions)',
                       'y': 'Country: '
                      })
fig.show()

The answer is yes. Unites states is also the biggest funded startups.

how about the companies?

In [29]:
top_companies = df.sort_values(by=['funding_total_usd'], ascending = False) [:15]

fig = px.bar(data_frame = top_companies,
             y = 'funding_total_usd',
             x = 'name',
             title = 'Top 15 most succesfull companies',
             labels = {'x': 'Total funding: ',
                       'y': 'Company: '}
            );

fig.show()

Correlation¶

In [30]:
fig = px.imshow(df.corr());
fig.show();
C:\Users\KeisezrG\AppData\Local\Temp\ipykernel_9720\1654500210.py:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

The Correlation between debt_financing and funding_total_usd. Let's inspect more about that variable.

In [31]:
df_filtered = df[(df['debt_financing']>0)&(df['funding_total_usd']/10**9 < 30)]

sns.scatterplot(data = df_filtered,
                x = 'funding_total_usd',
                y = 'debt_financing',
                marker = 'o'
               )
sns.regplot(data=df_filtered, 
            x='funding_total_usd', 
            y='debt_financing', 
            scatter=False)

plt.xlabel('Funding Total (USD)')
plt.ylabel('Debt Financing')
plt.title('Scatter Plot without Outliers')
plt.show()

The scatter plot indicates the more money you need for the startup, the more debt_financing you'll need.

Hypothesis: the seed we raise depends on the founded quater

are there any 'good' quaters to start first fund raising?

In [32]:
df['founded_quarter'].head()
Out[32]:
0     2012-Q2
2     2012-Q4
3     2011-Q2
4     2014-Q1
10    2012-Q1
Name: founded_quarter, dtype: object

Extract quater into separate solumn first

In [33]:
def separate_quarter(text: str):
    return text.split('-')[-1]

df['quater'] = df['founded_quarter'].apply(separate_quarter)
In [34]:
df_by_q_sort = df.sort_values(by='quater')
num_seed = []
for q in df_by_q_sort['quater'].unique():
    num_seed.append(len(df_by_q_sort[(df_by_q_sort['quater']==q)&(df_by_q_sort['seed'] > 0)]))
fig = px.histogram(
    y=num_seed, 
    x=df_by_q_sort['quater'].unique(),
    title='Quater x Seed popularity plot',
    color_discrete_sequence=["teal"],
    labels={'x': 'Quater the startup was founded at', 
    'y': 'Number of seeds raised'})
fig.show()

The biggest seed raised is Q1 and Q1 is also the most popular option among investors. The visualization above suggests that the first quarter is the most popular time for startups to secure their initial funding, supporting the validity of the hypothesis. Therefore, it is advisable to apply for seed funding during that period.